Supervised Learning - Foundations: ReCell

Marks: 60

Context

Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth \$52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.

Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.

Objective

The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.

Data Description

The data contains the different attributes of used/refurbished phones and tablets. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Data Overview

Observation :

Observation

Exploratory Data Analysis (EDA)

Plots for numerical columns

Observation:

1 . lot of skewness in the column

Observation:

1 . lot of skewness in the column

Observation:

1 . lot of skewness in the column

Observation:

1 . lot of skewness in the column

Observation:

1 . Mean and median are almost the same

Observation:

  1. lot of skewness in the column
  2. Mean and median are almost close

Observation:

1 . lot of skewness in the column

Observation:

1 . skewed to the left side

Observation:

1 . lot of skewness in the column

Observation:

1 . lot of skewness in the column

Plots for categorical columns

Observation:

  1. 14.5 of the columns has other brand names

Observation:

  1. 93.1 % of the column data is Android

Observation:

  1. 67.6% of the column data is 4g

Observation :

  1. 95.6% of the column data is 5g

Questions:

  1. What does the distribution of used device prices look like?
  2. What percentage of the used device market is dominated by Android devices?
  3. The amount of RAM is important for the smooth functioning of a device. How does the amount of RAM vary with the brand?
  4. A large battery often increases a device's weight, making it feel uncomfortable in the hands. How does the weight vary for phones and tablets offering large batteries (more than 4500 mAh)?
  5. Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones and tablets are available across different brands with a screen size larger than 6 inches?
  6. Budget devices nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of budget devices offering greater than 8MP selfie cameras across brands?
  7. Which attributes are highly correlated with the price of a used device?

Question 1 . What does the distribution of used device prices look like?

Observation :

  1. The distribution for the column Used devices prices show a normal distributon.
  2. It isnt a perfect normal disctibution. It shows skewness to the right side.
  3. However, we can confirm that the data is normally distributed.
  4. There are outliers

Question 2. What percentage of the used device market is dominated by Android devices?

Observation :

  1. 93.1% of the data is dominated by Android devices

Question 3 .The amount of RAM is important for the smooth functioning of a device. How does the amount of RAM vary with the brand?

Observation : As shown in the above plot

  1. One plus provides the highest RAM capacticy of 6 GB and above
  2. Followed by Oppo and Vivo phones with 5 GB and less
  3. celkon is the one with least RAM

Question 4. A large battery often increases a device's weight, making it feel uncomfortable in the hands. How does the weight vary for phones and tablets offering large batteries (more than 4500 mAh)?

Observation :

Question 5. Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones and tablets are available across different brands with a screen size larger than 6 inches?

Observation:

  1. Almost all brands produce screens larger than 6 inches
  2. 13% percent of Huawei brand show bigger screens.
  3. samsung is second highest brand by 10 %
  4. Microsoft is the least.

Question 6. Budget devices nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of budget devices offering greater than 8MP selfie cameras across brands?

Question 7. Which attributes are highly correlated with the price of a used device?

Observation : Checking for the target variable(used_price) correlation with the indepedant variables

  1. used price is highly correlated with new_price
  2. least correlated with days_used

Summary of the EDA

Data Description:

Observations from EDA:

Data that requires Preprocessing:

Data Preprocessing

1. Missing value treatment:

All the above Histogram and Box plots for the numerical columns above shows the distribution of the data is extremely skewed. Hence, we can use median values in place of missing values

Observation

  1. column 'main_camera_mp' has 179 missing values
  2. column 'selfie_camera_mp' has 2 missing values
  3. column 'int_memory' & 'ram' has 4 missing values
  4. column 'battery' has 6 and 'weight' has 7 missing values

Observation : All the missing values have been imputed with respective median values

Feature engineering

Log Transformation

Observation : considering columns weight ,new price and used price for log transformation as they are highly skewed

Log transformation

Observation :

  1. columns of interest 'weight ', 'new_price' and 'used_price' has now been tranformed and show normal distribution.

Outlier detection using IQR

plotting individual IQR plots only on the columns of interest 'weight ', 'new_price' and 'used_price'

Observation :

Outlier Treatment

Recheking after trimming ouliers on weight,new price and used price

Rechecking Outlier detection using IQR

plotting individual IQR plots only on the columns of interest 'weight ', 'new_price' and 'used_price'

Observation : Outliers for the columns of interest weight , new price and used price has been trimmed

EDA Post Data Preprocessing

Observation : Object columns brand_name, OS, 4g and 5g

Observation : All the object columns have been converted to categorical columns

Univariate Analysis Post Data PreProcessing

Weight:

Observation :

New_Price :

Observation :

Used Price:

Observation :

Bi Variate Analysis against Target Variable 'Used_price'

Used price vs Brand name

Used Price vs OS

Used price vs 4g

Used price vs 5g

Used price vs main_camera_mp

Used price vs selfie camera mp

Used price vs Int memory

Used price vs ram

Used price vs weight

Used price vs Screen size

Used price vs Release year

Used price vs Days used

Used price vs New price

EDA Summary Post Data Preprocessing

Data Description:

Data Cleaning:

Observations from EDA:

Building a Linear Regression model

Observation

Model performance check

Checking model performance on train set

Checking model performance on test set

Observations

Checking Linear Regression Assumptions

No Multicollinearity

Linearity of variables

Independence of error terms

Normality of error terms

No Heteroscedasticity

MULTICOLLINEARITY TEST

We will test for multicollinearity using VIF.

General Rule of thumb:

Observation

Removing Multicollinearity

Observation :

There is no more VIF above 5 . Hence, we can confirm that the assumption of multicollinearity is satisfied

Dropping P values > 5

Dropping high p-value variables

Observation:

  1. All the values above 0.5 P values has been removed.
  2. The new Model Adjusted R-squared has dropped a very little compared the previous model.
  3. Hence we can conlude variables dropped has not much affect on the model and it is a good Model

Linearity of variables

Observation :

We see no pattern in the plot above. Hence, the assumptions of linearity and independence are satisfied.

TEST FOR NORMALITY

Null hypothesis: Residuals are normally distributed

Alternate hypothesis: Residuals are not normally distributed

Let's check the shape of the residual

The Histogram has the shape of a bell and has normal distribution

Let's check the Q-Q plot.

Let's use shapiro test for normality

TEST FOR HOMOSCEDASTICITY

*If we get a p-value greater than 0.05, we can say that the residuals are homoscedastic. Otherwise, they are heteroscedastic.

goldfeldquandt test

Observation :

  1. P values is greater than 0.5
  2. Hence , we can conclude the assumption is satisfied

Final Model Summary

Now that we have checked all the assumptions of linear regression and they are satisfied, we can move towards the prediction part.

Note: As the number of records is large, for representation purpose, we are taking a sample of 25 records only.

Observations

The model is able to explain ~83% of the variation in the data, which is very good.

The train and test RMSE and MAE are low and comparable. So, our model is not suffering from overfitting.

The MAPE on the test set suggests we can predict within ~4.3% of the used_price.

Hence, we can conclude the model olsmod2 is good for prediction as well as inference purposes.

Let's recreate the final statsmodels model and print it's summary to gain insights.

Insights & Conclusion

Recommendations